{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.5 Joint distributions and correlations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are often interested not in the distribution of a single variable but in the relationship between two or more variables. This requires us to understand the concepts of **joint distributions** and **correlation**. \n",
    "\n",
    "Returning to the BMI dataset, a high BMI is indicative of being overweight and this is likely to mean that an individual may have a high percentage of body fat. Typically, those individuals with high BMI may also be at risk of health conditions such as heart disease, which may be indicated by high blood pressure. \n",
    "\n",
    "If we wish to address questions relating to two or more variables, we need to understand their joint distribution.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.5.1. Joint distributions\n",
    "\n",
    "If we have two random variables $X$ and $Y$, the cumulative joint distribution function (CDF) is, \n",
    "\n",
    "$$F(x,y) = P(X \\leq x,Y \\leq y)$$\n",
    "\n",
    "regardless of whether $X$ and $Y$ are continuous or discrete. For continuous random variables the joint density function will be $f(x,y)$ and will be non-negative and \n",
    "\n",
    "$$\\int_{-\\infty}^{\\infty} \\int_{-\\infty}^{\\infty} f(x,y)\\: dy\\: dx = 1.$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.5.2 Marginal distributions\n",
    "\n",
    "We might sometimes want to think about the marginal density of, say, $X$. This means we want to know the probability of $X$ irrespective of $Y$, and consequently we will need to integrate over all possible values of $Y$. The marginal cdf of $X$, or $F_X$ is \n",
    "\n",
    "$$F_X (x) = P(X \\leq x)$$\n",
    "$$ = \\lim_{y \\rightarrow \\infty} F(x,y)$$\n",
    "$$ = \\int_{-\\infty}^{x} \\int_{-\\infty}^{\\infty} f(u,y)\\: dy\\: du$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From this, it follows that the density function of $X$ alone, known as the **marginal density** of $X$, is\n",
    "\n",
    "$$f_x (x) = F_{X}'(x) = \\int_{-\\infty}^{\\infty} f(x,y)\\: dy$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this is different to assuming that $X$ is independent of $Y$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So what does this mean in practical terms? Returning to the BMI data we can report that the average BMI ($\\mu_X$) is 26.46 and the average body fat percentage ($\\mu_Y$) is 35.31. If BMI and body fat were independent variables knowing BMI would tell us nothing about body fat and *vice versa*. But plotting the data (and some common sense) tells us that this is not the case; if we know one we can say quite a lot about the other. We could explore the correlation between the data (more about this later), but we can also describe these variables together using a joint distribution. By defining them using a joint distribution we are saying nothing about *cause and effect*, just that they are dependent variables. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th scope=col>Cond</th><th scope=col>Age</th><th scope=col>Wt</th><th scope=col>Wt2</th><th scope=col>BMI</th><th scope=col>BMI2</th><th scope=col>Fat</th><th scope=col>Fat2</th><th scope=col>WHR</th><th scope=col>WHR2</th><th scope=col>Syst</th><th scope=col>Syst2</th><th scope=col>Diast</th><th scope=col>Diast2</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><td>0    </td><td>43   </td><td>137  </td><td>137.4</td><td>25.1 </td><td>25.1 </td><td>31.9 </td><td>32.8 </td><td>0.79 </td><td>0.79 </td><td>124  </td><td>118  </td><td>70   </td><td>73   </td></tr>\n",
       "\t<tr><td>0    </td><td>42   </td><td>150  </td><td>147.0</td><td>29.3 </td><td>28.7 </td><td>35.5 </td><td>  NA </td><td>0.81 </td><td>0.81 </td><td>119  </td><td>112  </td><td>80   </td><td>68   </td></tr>\n",
       "\t<tr><td>0    </td><td>41   </td><td>124  </td><td>124.8</td><td>26.9 </td><td>27.0 </td><td>35.1 </td><td>  NA </td><td>0.84 </td><td>0.84 </td><td>108  </td><td>107  </td><td>59   </td><td>65   </td></tr>\n",
       "\t<tr><td>0    </td><td>40   </td><td>173  </td><td>171.4</td><td>32.8 </td><td>32.4 </td><td>41.9 </td><td>42.4 </td><td>1.00 </td><td>1.00 </td><td>116  </td><td>126  </td><td>71   </td><td>79   </td></tr>\n",
       "\t<tr><td>0    </td><td>33   </td><td>163  </td><td>160.2</td><td>37.9 </td><td>37.2 </td><td>41.7 </td><td>  NA </td><td>0.86 </td><td>0.84 </td><td>113  </td><td>114  </td><td>73   </td><td>78   </td></tr>\n",
       "\t<tr><td>0    </td><td>24   </td><td> 90  </td><td> 91.8</td><td>16.5 </td><td>16.8 </td><td>  NA </td><td>  NA </td><td>0.73 </td><td>0.73 </td><td> NA  </td><td> NA  </td><td>78   </td><td>76   </td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "\\begin{tabular}{r|llllllllllllll}\n",
       " Cond & Age & Wt & Wt2 & BMI & BMI2 & Fat & Fat2 & WHR & WHR2 & Syst & Syst2 & Diast & Diast2\\\\\n",
       "\\hline\n",
       "\t 0     & 43    & 137   & 137.4 & 25.1  & 25.1  & 31.9  & 32.8  & 0.79  & 0.79  & 124   & 118   & 70    & 73   \\\\\n",
       "\t 0     & 42    & 150   & 147.0 & 29.3  & 28.7  & 35.5  &   NA  & 0.81  & 0.81  & 119   & 112   & 80    & 68   \\\\\n",
       "\t 0     & 41    & 124   & 124.8 & 26.9  & 27.0  & 35.1  &   NA  & 0.84  & 0.84  & 108   & 107   & 59    & 65   \\\\\n",
       "\t 0     & 40    & 173   & 171.4 & 32.8  & 32.4  & 41.9  & 42.4  & 1.00  & 1.00  & 116   & 126   & 71    & 79   \\\\\n",
       "\t 0     & 33    & 163   & 160.2 & 37.9  & 37.2  & 41.7  &   NA  & 0.86  & 0.84  & 113   & 114   & 73    & 78   \\\\\n",
       "\t 0     & 24    &  90   &  91.8 & 16.5  & 16.8  &   NA  &   NA  & 0.73  & 0.73  &  NA   &  NA   & 78    & 76   \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/markdown": [
       "\n",
       "| Cond | Age | Wt | Wt2 | BMI | BMI2 | Fat | Fat2 | WHR | WHR2 | Syst | Syst2 | Diast | Diast2 |\n",
       "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n",
       "| 0     | 43    | 137   | 137.4 | 25.1  | 25.1  | 31.9  | 32.8  | 0.79  | 0.79  | 124   | 118   | 70    | 73    |\n",
       "| 0     | 42    | 150   | 147.0 | 29.3  | 28.7  | 35.5  |   NA  | 0.81  | 0.81  | 119   | 112   | 80    | 68    |\n",
       "| 0     | 41    | 124   | 124.8 | 26.9  | 27.0  | 35.1  |   NA  | 0.84  | 0.84  | 108   | 107   | 59    | 65    |\n",
       "| 0     | 40    | 173   | 171.4 | 32.8  | 32.4  | 41.9  | 42.4  | 1.00  | 1.00  | 116   | 126   | 71    | 79    |\n",
       "| 0     | 33    | 163   | 160.2 | 37.9  | 37.2  | 41.7  |   NA  | 0.86  | 0.84  | 113   | 114   | 73    | 78    |\n",
       "| 0     | 24    |  90   |  91.8 | 16.5  | 16.8  |   NA  |   NA  | 0.73  | 0.73  |  NA   |  NA   | 78    | 76    |\n",
       "\n"
      ],
      "text/plain": [
       "  Cond Age Wt  Wt2   BMI  BMI2 Fat  Fat2 WHR  WHR2 Syst Syst2 Diast Diast2\n",
       "1 0    43  137 137.4 25.1 25.1 31.9 32.8 0.79 0.79 124  118   70    73    \n",
       "2 0    42  150 147.0 29.3 28.7 35.5   NA 0.81 0.81 119  112   80    68    \n",
       "3 0    41  124 124.8 26.9 27.0 35.1   NA 0.84 0.84 108  107   59    65    \n",
       "4 0    40  173 171.4 32.8 32.4 41.9 42.4 1.00 1.00 116  126   71    79    \n",
       "5 0    33  163 160.2 37.9 37.2 41.7   NA 0.86 0.84 113  114   73    78    \n",
       "6 0    24   90  91.8 16.5 16.8   NA   NA 0.73 0.73  NA   NA   78    76    "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "ename": "ERROR",
     "evalue": "Error in ggplot(dat, aes(x = BMI, y = Fat)): could not find function \"ggplot\"\n",
     "output_type": "error",
     "traceback": [
      "Error in ggplot(dat, aes(x = BMI, y = Fat)): could not find function \"ggplot\"\nTraceback:\n"
     ]
    }
   ],
   "source": [
    "options(repr.plot.width=4, repr.plot.height=3)\n",
    "\n",
    "# BMI dataset\n",
    "\n",
    "dat <- read.csv(\"Practicals/Datasets/BMI/MindsetMatters.csv\")\n",
    "head(dat)\n",
    "#remove observations with no BMI data\n",
    "dat <- dat[!is.na(dat$BMI),]\n",
    "# scatter plot of BMI and body fat\n",
    "ggplot(dat,aes(x=BMI,y=Fat)) + geom_point()\n",
    "\n",
    "# report the mean of each variable\n",
    "# note that some values of Y are missing...we need to add na.rm otherwise the estimate will be NA\n",
    "mux <- mean(dat$BMI)\n",
    "print(paste0(\"value of mu_x is \",round(mux,2)))\n",
    "muy <- mean(dat$Fat,na.rm=T)\n",
    "print(paste0(\"value of mu_y is \",round(muy,2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So this joint distribution has a joint cdf, $F(x,y)$ and a continuous piecewise density function $f(x,y)$. The joint mean is defined as $\\mu_x,\\mu_y$ What about the variance? Here we need to consider the variance and covaraince between $X$ and $Y$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# correlation between variables\n",
    "dat2 <- dat[!is.na(dat$Fat),]\n",
    "round(cov(x=cbind(dat2$BMI,dat2$Fat)),3)\n",
    "paste0(\"variance of BMI = \",round(var(dat2$BMI),3))\n",
    "paste0(\"variance of fat = \",round(var(dat2$Fat),3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The *covariance matrix* is returned. The diagnoals return the variance of each parameter, and the off-diagnoals the covariance, indicating a positive correlation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.5.3  Correlation\n",
    "\n",
    "Correlation and covariance are closely related.  Pearson's correlation coefficient is defined as:\n",
    "\n",
    "$$ \n",
    "\\rho(X,Y) = Corr(X,Y) = \\frac{Cov(X,Y)}{SD(X)SD(Y)}\n",
    "$$ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So this helps us define BMI from body fat and *vice versa*. Examples of when this might be useful include;\n",
    "* Inputing missing data\n",
    "* Summarising many variables with one metric (more about this in the Machine learning module)\n",
    "* Efficient sampling of distributions, which is used in Monte Carlo Markov Chain (MCMC) estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.5.4 Connections to regression modelling\n",
    "\n",
    "Later sessions exploring regression modelling will provide a powerful and flexible approach to exploring and quantifying *dependencies* between variables. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}